CBOW (Continuous Bag of Words) Embedding Berita#

Notebook ini menjelaskan implementasi CBOW embedding menggunakan Word2Vec untuk menganalisis teks berita. CBOW adalah salah satu arsitektur Word2Vec yang memprediksi kata target berdasarkan konteks kata-kata di sekitarnya.

Tujuan:#

  • Membuat embedding vektor untuk kata-kata dalam dataset berita

  • Menggunakan Word2Vec dengan arsitektur CBOW

  • Mengekstrak fitur numerik dari teks untuk analisis lebih lanjut

1. Instalasi Library#

Menginstal library yang diperlukan:

  • plotly: untuk visualisasi interaktif

  • gensim: library utama untuk Word2Vec dan embedding

%%capture
!pip install plotly
!pip install --upgrade gensim

2. Import Library dan Load Data#

Mengimport library yang diperlukan dan memuat dataset berita yang sudah dipreprocessing:

  • gensim.models: untuk Word2Vec dan FastText

  • pandas: untuk manipulasi data

  • sklearn.decomposition.PCA: untuk reduksi dimensi

  • matplotlib dan plotly: untuk visualisasi

  • numpy: untuk operasi numerik

from gensim.models import Word2Vec, FastText
import pandas as pd
import re

from sklearn.decomposition import PCA

from matplotlib import pyplot as plt
import plotly.graph_objects as go

import numpy as np

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('hasil_preprocessing_berita.csv')
df
isi hasil_preprocessing kategori
0 TNImasih mempertimbangkan langkah hukum yang a... ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... nasional
1 Politikus Partai GerindraRahayu Saraswati Djoj... ['politikus', 'partai', 'gerindrarahayu', 'sar... nasional
2 Staf Khusus Gubernur DKI Jakarta Bidang Komuni... ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... nasional
3 Politikus Partai Gerindrayang juga keponakan P... ['politikus', 'partai', 'gerindrayang', 'kepon... nasional
4 Keponakan Presiden Prabowo Subianto,Rahayu Sar... ['keponakan', 'presiden', 'prabowo', 'subianto... nasional
... ... ... ...
146 DKI evaluasi cakupanimunisasicampak hingga tin... ['dki', 'evaluasi', 'cakupanimunisasicampak', ... gaya-hidup
147 Situasi negara saat ini tak pelak bikinstres. ... ['situasi', 'negara', 'pelak', 'bikinstres', '... gaya-hidup
148 Banyak pakarkesehatanmenganjurkan konsumsisayu... ['pakarkesehatanmenganjurkan', 'konsumsisayura... gaya-hidup
149 MaskapaiRyanair mengimbau para penumpang yang ... ['maskapairyanair', 'imbau', 'tumpang', 'alami... gaya-hidup
150 Mengenal gejala dan penanganandiabetespada ana... ['kenal', 'gejala', 'penanganandiabetespada', ... gaya-hidup

151 rows × 3 columns

3. Definisi Kelas Custom#

MyTokenizer#

Kelas untuk tokenisasi teks sederhana:

  • Mengubah teks menjadi lowercase

  • Memisahkan kata berdasarkan spasi

MeanEmbeddingVectorizer#

Kelas untuk mengubah teks menjadi vektor embedding:

  • Menggunakan model Word2Vec yang sudah dilatih

  • Mengambil rata-rata vektor kata untuk setiap dokumen

  • Menangani kata yang tidak ada dalam vocabulary

from gensim.models import Word2Vec

4. Preprocessing Teks#

Membersihkan teks berita dengan:

  1. Konversi ke lowercase: Menyeragamkan format teks

  2. Menghapus punctuation: Menghilangkan tanda baca dan karakter non-alfabet

  3. Menghapus HTML tags: Membersihkan tag HTML jika ada

  4. Menghapus digit dan karakter khusus: Membersihkan angka dan karakter non-alfabet

Hasil preprocessing disimpan dalam kolom ‘clean’.

import numpy as np

class MyTokenizer:
    def fit_transform(self, texts):
        # Tokenisasi sederhana: lowercase + split
        return [str(text).lower().split() for text in texts]

class MeanEmbeddingVectorizer:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model
        # Perbaikan: gunakan vector_size (Gensim ≥ 4.0)
        self.dim = word2vec_model.wv.vector_size

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tokenized = MyTokenizer().fit_transform(X)
        embeddings = []
        for words in X_tokenized:
            # Ambil vektor hanya untuk kata yang ada di vocab
            valid_vectors = [
                self.word2vec.wv[word] for word in words
                if word in self.word2vec.wv
            ]
            if valid_vectors:
                embeddings.append(np.mean(valid_vectors, axis=0))
            else:
                embeddings.append(np.zeros(self.dim))
        return np.array(embeddings)

    def fit_transform(self, X, y=None):
        return self.transform(X)

5. Pembuatan Corpus dan Training Word2Vec#

Pembuatan Corpus#

  • Memecah teks yang sudah dibersihkan menjadi list kata

  • Setiap dokumen menjadi list kata terpisah

Training Model Word2Vec#

  • Arsitektur: CBOW (default Word2Vec)

  • min_count=1: Termasuk semua kata (bahkan yang muncul sekali)

  • vector_size=56: Dimensi vektor embedding 56

  • Model akan mempelajari representasi vektor untuk setiap kata berdasarkan konteksnya

clean_txt = []
for w in range(len(df['hasil_preprocessing'])):
   desc = str(df['hasil_preprocessing'][w]).lower()

   #remove punctuation
   desc = re.sub('[^a-zA-Z]', ' ', desc)

   #remove tags
   desc=re.sub("</?.*?>"," <> ",desc)

   #remove digits and special chars
   desc=re.sub("(\\d|\\W)+"," ",desc)
   clean_txt.append(desc)

df['clean'] = clean_txt
df.head()
isi hasil_preprocessing kategori clean
0 TNImasih mempertimbangkan langkah hukum yang a... ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... nasional tnimasih timbang langkah hukum ambil ceo mala...
1 Politikus Partai GerindraRahayu Saraswati Djoj... ['politikus', 'partai', 'gerindrarahayu', 'sar... nasional politikus partai gerindrarahayu saraswati djo...
2 Staf Khusus Gubernur DKI Jakarta Bidang Komuni... ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... nasional staf khusus gubernur dki jakarta bidang komun...
3 Politikus Partai Gerindrayang juga keponakan P... ['politikus', 'partai', 'gerindrayang', 'kepon... nasional politikus partai gerindrayang keponakan presi...
4 Keponakan Presiden Prabowo Subianto,Rahayu Sar... ['keponakan', 'presiden', 'prabowo', 'subianto... nasional keponakan presiden prabowo subiantorahayu sar...

6. Eksplorasi Model Word2Vec#

Analisis Similaritas Kata#

  • most_similar(): Mencari kata yang paling mirip dengan kata probe

  • most_similar_cosmul(): Mencari kata yang mirip dengan kombinasi kata positif dan negatif

  • doesnt_match(): Mencari kata yang tidak cocok dalam sekelompok kata

Penyimpanan Embedding#

  • Menyimpan vektor embedding dalam format Word2Vec

  • File: berita_embd.txt (format teks, bukan binary)

df.shape
(151, 4)

7. Ekstraksi Embedding untuk Dokumen#

Menggunakan MeanEmbeddingVectorizer untuk mengubah setiap dokumen menjadi vektor:

  • Input: Teks dokumen yang sudah dibersihkan

  • Proses:

    1. Tokenisasi teks menjadi kata-kata

    2. Ambil vektor embedding untuk setiap kata dari model Word2Vec

    3. Hitung rata-rata vektor kata untuk mendapatkan representasi dokumen

  • Output: Vektor 56 dimensi untuk setiap dokumen

corpus = []
for col in df.clean:
   word_list = col.split(" ")
   corpus.append(word_list)

#show first value
corpus[0:1]

#generate vectors from corpus
model = Word2Vec(corpus, min_count=1, vector_size = 56)

8. Validasi Embedding#

Memeriksa panjang embedding untuk memastikan konsistensi:

  • Setiap dokumen harus memiliki vektor dengan panjang 56 (sesuai dengan vector_size)

  • Ini memastikan bahwa proses embedding berjalan dengan benar

# Explore embeddings safely using an in-vocabulary token
# Pick a common Indonesian token if available, else fallback to the first vocab token
candidate_tokens = ['indonesia', 'pemerintah', 'jakarta', 'presiden', 'ekonomi']
probe = None
for tok in candidate_tokens:
    if tok in model.wv:
        probe = tok
        break
if probe is None:
    probe = model.wv.index_to_key[0]

print('Probe token:', probe)
print('Top similar:')
print(model.wv.most_similar(probe)[:10])

# Optional: cosine mul example if tokens exist
pos = [t for t in ['pemerintah', 'indonesia'] if t in model.wv]
neg = [t for t in ['oposisi'] if t in model.wv]
if pos:
    print('Cosmul example:')
    print(model.wv.most_similar_cosmul(positive=pos, negative=neg)[:10])

# Optional: doesnt_match example when enough tokens exist
cands = [t for t in ['ekonomi', 'politik', 'olahraga', 'jakarta'] if t in model.wv]
if len(cands) >= 3:
    print('Odd-one-out:')
    print(model.wv.doesnt_match(cands))

# Save embeddings
filename = 'berita_embd.txt'
model.wv.save_word2vec_format(filename, binary=False)
Probe token: indonesia
Top similar:
[('jalan', 0.994790256023407), ('persen', 0.9943721890449524), ('dukung', 0.9940369725227356), ('perintah', 0.9933809638023376), ('salah', 0.9932514429092407), ('orang', 0.9931586384773254), ('ekonomi', 0.9930185675621033), ('usaha', 0.9929376840591431), ('kali', 0.9924567341804504), ('purbaya', 0.9922690391540527)]
Cosmul example:
[('jalan', 0.9973941445350647), ('persen', 0.9971851110458374), ('dukung', 0.997017502784729), ('perintah', 0.99668949842453), ('salah', 0.996624767780304), ('orang', 0.9965783357620239), ('ekonomi', 0.9965083003044128), ('usaha', 0.9964678883552551), ('kali', 0.9962273836135864), ('purbaya', 0.99613356590271)]
Odd-one-out:
politik

9. Konversi ke DataFrame#

Mengubah array embedding menjadi DataFrame dengan kolom terpisah:

  • Input: Array embedding 2D (151 dokumen × 56 fitur)

  • Proses:

    1. Membuat kolom f1, f2, …, f56 untuk setiap dimensi

    2. Mengisi setiap kolom dengan nilai dari dimensi yang sesuai

  • Output: DataFrame dengan 151 baris dan 56 kolom fitur

  • Tujuan: Memudahkan analisis dan visualisasi data

12. Visualisasi Embedding#

Menambahkan visualisasi untuk menganalisis hasil embedding:

  • PCA Visualization: Reduksi dimensi untuk visualisasi 2D

  • Similarity Heatmap: Matriks similaritas antar dokumen

  • Embedding Distribution: Distribusi nilai embedding

  • Category Analysis: Analisis embedding berdasarkan kategori

mean_embedding_vectorizer = MeanEmbeddingVectorizer(model)
mean_embedded = mean_embedding_vectorizer.fit_transform(df['clean'])

10. Penambahan Label (Opsional)#

Mencoba menambahkan kolom label jika tersedia:

  • Mencari kolom ‘kategori’ dalam DataFrame asli

  • Jika ditemukan, menyalin label ke DataFrame embedding

  • Jika tidak ditemukan, memberikan peringatan

Catatan: Label diperlukan untuk supervised learning atau evaluasi model.

df['array']=list(mean_embedded)

11. Hasil Akhir#

Ringkasan Proses:#

  1. Preprocessing: Membersihkan teks berita

  2. Training Word2Vec: Membuat model CBOW dengan 56 dimensi

  3. Ekstraksi Embedding: Mengubah dokumen menjadi vektor numerik

  4. Konversi DataFrame: Mengubah array menjadi format tabular

Output:#

  • DataFrame embedding: 151 baris × 56 kolom fitur

  • File embedding: berita_embd.txt (format Word2Vec)

  • Model Word2Vec: Siap digunakan untuk analisis similaritas kata

Aplikasi Selanjutnya:#

  • Clustering dokumen

  • Klasifikasi teks

  • Analisis similaritas dokumen

  • Visualisasi embedding dengan PCA/t-SNE

df.head(5)
isi hasil_preprocessing kategori clean array
0 TNImasih mempertimbangkan langkah hukum yang a... ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... nasional tnimasih timbang langkah hukum ambil ceo mala... [-0.03424158, 0.023815494, 0.017993217, 0.0083...
1 Politikus Partai GerindraRahayu Saraswati Djoj... ['politikus', 'partai', 'gerindrarahayu', 'sar... nasional politikus partai gerindrarahayu saraswati djo... [-0.042456638, 0.030742344, 0.024186132, 0.011...
2 Staf Khusus Gubernur DKI Jakarta Bidang Komuni... ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... nasional staf khusus gubernur dki jakarta bidang komun... [-0.035708997, 0.026341174, 0.019657917, 0.007...
3 Politikus Partai Gerindrayang juga keponakan P... ['politikus', 'partai', 'gerindrayang', 'kepon... nasional politikus partai gerindrayang keponakan presi... [-0.042115077, 0.03054803, 0.021272218, 0.0093...
4 Keponakan Presiden Prabowo Subianto,Rahayu Sar... ['keponakan', 'presiden', 'prabowo', 'subianto... nasional keponakan presiden prabowo subiantorahayu sar... [-0.04753819, 0.031029927, 0.026658049, 0.0126...
df['embedding_length'] = df['array'].str.len()
print(df['embedding_length'])
0      56
1      56
2      56
3      56
4      56
       ..
146    56
147    56
148    56
149    56
150    56
Name: embedding_length, Length: 151, dtype: int64
df.shape
(151, 6)
num_features = len(df['array'].iloc[0])  # asumsi semua list punya panjang sama
columns = [f'f{i+1}' for i in range(num_features)]

# Inisialisasi dictionary untuk menampung data per kolom
data_dict = {col: [] for col in columns}

# Looping setiap baris di kolom 'embedding'
for embedding_list in df['array']:
    for i, value in enumerate(embedding_list):
        data_dict[f'f{i+1}'].append(value)

# Buat DataFrame dari dictionary
embedding_df = pd.DataFrame(data_dict)

embedding_df
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 ... f47 f48 f49 f50 f51 f52 f53 f54 f55 f56
0 -0.034242 0.023815 0.017993 0.008343 0.001025 -0.046393 0.012217 -0.050655 -0.031033 -0.023668 ... 0.009383 0.025085 -0.005328 0.019430 0.002036 0.035565 0.000871 0.007080 -0.012785 -0.023511
1 -0.042457 0.030742 0.024186 0.011818 -0.001329 -0.062848 0.013246 -0.070055 -0.040546 -0.035038 ... 0.013874 0.031744 -0.007445 0.026577 0.001240 0.045626 -0.004341 0.009710 -0.017322 -0.031579
2 -0.035709 0.026341 0.019658 0.007437 -0.001668 -0.052641 0.012276 -0.054184 -0.033147 -0.025083 ... 0.009727 0.025435 -0.004278 0.020648 0.000802 0.037979 -0.001430 0.008824 -0.015628 -0.025066
3 -0.042115 0.030548 0.021272 0.009377 -0.003075 -0.056875 0.012645 -0.056823 -0.035181 -0.028878 ... 0.010294 0.029869 -0.008437 0.024573 0.004130 0.040517 -0.000205 0.011971 -0.014870 -0.026872
4 -0.047538 0.031030 0.026658 0.012629 -0.001332 -0.067935 0.015888 -0.074749 -0.044312 -0.037645 ... 0.015583 0.034910 -0.009198 0.030109 0.000316 0.052512 -0.004867 0.009321 -0.019838 -0.035834
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
146 -0.040215 0.029681 0.021212 0.008749 0.000038 -0.058004 0.013599 -0.062187 -0.037696 -0.032407 ... 0.012856 0.029878 -0.005884 0.026453 0.001851 0.043063 -0.000105 0.010249 -0.016052 -0.028944
147 -0.047534 0.034729 0.022151 0.012573 -0.001855 -0.061164 0.013678 -0.063166 -0.038975 -0.029292 ... 0.011725 0.035503 -0.007956 0.024593 0.005020 0.043391 0.001213 0.010205 -0.013986 -0.029877
148 -0.052395 0.037619 0.026534 0.008568 -0.003877 -0.067647 0.015677 -0.070051 -0.040592 -0.031019 ... 0.010935 0.040182 -0.009143 0.030671 0.004978 0.050077 0.000574 0.014260 -0.015499 -0.032130
149 -0.040506 0.029215 0.021303 0.007917 -0.000835 -0.056548 0.014421 -0.057189 -0.035690 -0.030028 ... 0.012104 0.032007 -0.006563 0.026857 0.002147 0.042390 0.000947 0.009823 -0.013878 -0.028389
150 -0.044656 0.033810 0.019051 0.009090 -0.000588 -0.060648 0.013711 -0.062389 -0.037996 -0.030327 ... 0.011028 0.033929 -0.006889 0.027399 0.005729 0.041514 -0.001705 0.010731 -0.016428 -0.031244

151 rows × 56 columns

# Tambahkan kolom label jika tersedia pada df
possible_labels = ['kategori']
label_col = None
for c in possible_labels:
    if c in df.columns:
        label_col = c
        break

if label_col is not None:
    embedding_df[label_col] = df[label_col].values
else:
    print('Peringatan: Tidak ditemukan kolom label di df. Lewati penyalinan label.')
embedding_df
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 ... f48 f49 f50 f51 f52 f53 f54 f55 f56 kategori
0 -0.034242 0.023815 0.017993 0.008343 0.001025 -0.046393 0.012217 -0.050655 -0.031033 -0.023668 ... 0.025085 -0.005328 0.019430 0.002036 0.035565 0.000871 0.007080 -0.012785 -0.023511 nasional
1 -0.042457 0.030742 0.024186 0.011818 -0.001329 -0.062848 0.013246 -0.070055 -0.040546 -0.035038 ... 0.031744 -0.007445 0.026577 0.001240 0.045626 -0.004341 0.009710 -0.017322 -0.031579 nasional
2 -0.035709 0.026341 0.019658 0.007437 -0.001668 -0.052641 0.012276 -0.054184 -0.033147 -0.025083 ... 0.025435 -0.004278 0.020648 0.000802 0.037979 -0.001430 0.008824 -0.015628 -0.025066 nasional
3 -0.042115 0.030548 0.021272 0.009377 -0.003075 -0.056875 0.012645 -0.056823 -0.035181 -0.028878 ... 0.029869 -0.008437 0.024573 0.004130 0.040517 -0.000205 0.011971 -0.014870 -0.026872 nasional
4 -0.047538 0.031030 0.026658 0.012629 -0.001332 -0.067935 0.015888 -0.074749 -0.044312 -0.037645 ... 0.034910 -0.009198 0.030109 0.000316 0.052512 -0.004867 0.009321 -0.019838 -0.035834 nasional
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
146 -0.040215 0.029681 0.021212 0.008749 0.000038 -0.058004 0.013599 -0.062187 -0.037696 -0.032407 ... 0.029878 -0.005884 0.026453 0.001851 0.043063 -0.000105 0.010249 -0.016052 -0.028944 gaya-hidup
147 -0.047534 0.034729 0.022151 0.012573 -0.001855 -0.061164 0.013678 -0.063166 -0.038975 -0.029292 ... 0.035503 -0.007956 0.024593 0.005020 0.043391 0.001213 0.010205 -0.013986 -0.029877 gaya-hidup
148 -0.052395 0.037619 0.026534 0.008568 -0.003877 -0.067647 0.015677 -0.070051 -0.040592 -0.031019 ... 0.040182 -0.009143 0.030671 0.004978 0.050077 0.000574 0.014260 -0.015499 -0.032130 gaya-hidup
149 -0.040506 0.029215 0.021303 0.007917 -0.000835 -0.056548 0.014421 -0.057189 -0.035690 -0.030028 ... 0.032007 -0.006563 0.026857 0.002147 0.042390 0.000947 0.009823 -0.013878 -0.028389 gaya-hidup
150 -0.044656 0.033810 0.019051 0.009090 -0.000588 -0.060648 0.013711 -0.062389 -0.037996 -0.030327 ... 0.033929 -0.006889 0.027399 0.005729 0.041514 -0.001705 0.010731 -0.016428 -0.031244 gaya-hidup

151 rows × 57 columns

embedding_df.shape
(151, 57)
# 1. PCA Visualization untuk Embedding
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Reduksi dimensi dengan PCA
pca = PCA(n_components=2)
embedding_2d = pca.fit_transform(embedding_df.iloc[:, :-1])  # Exclude kategori column

# Visualisasi dengan Matplotlib
plt.figure(figsize=(12, 8))
categories = embedding_df['kategori'].unique()
colors = ['red', 'blue', 'green', 'orange', 'purple']

for i, category in enumerate(categories):
    mask = embedding_df['kategori'] == category
    plt.scatter(embedding_2d[mask, 0], embedding_2d[mask, 1], 
               c=colors[i % len(colors)], label=category, alpha=0.7, s=50)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA Visualization of News Embeddings by Category')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Explained variance ratio: PC1={pca.explained_variance_ratio_[0]:.3f}, PC2={pca.explained_variance_ratio_[1]:.3f}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")
_images/39a1e914ee26ff953d002d6405140872b9f7c13788d7f7858547adba81afa779.png
Explained variance ratio: PC1=0.937, PC2=0.028
Total explained variance: 0.966
# 3. Similarity Heatmap untuk beberapa dokumen
# Ambil sample 20 dokumen untuk heatmap
sample_size = min(20, len(embedding_df))
sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
sample_embeddings = embedding_df.iloc[sample_indices, :-1]  # Exclude kategori

# Hitung cosine similarity
similarity_matrix = cosine_similarity(sample_embeddings)

# Visualisasi heatmap dengan Matplotlib
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
plt.xlabel('Document Index')
plt.ylabel('Document Index')

# Tambahkan label kategori
categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
for i, cat in enumerate(categories_sample):
    plt.text(i, -0.5, cat[:3], rotation=45, ha='right', va='top', fontsize=8)

plt.tight_layout()
plt.show()

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"Average similarity: {similarity_matrix.mean():.3f}")
print(f"Max similarity: {similarity_matrix.max():.3f}")
print(f"Min similarity: {similarity_matrix.min():.3f}")
_images/56bbfb0e11998709ca97487d5d2a57476769aec9f92d3bc19c2572c808797c32.png
Similarity matrix shape: (20, 20)
Average similarity: 0.997
Max similarity: 1.000
Min similarity: 0.990
# 4. Distribusi Embedding per Kategori
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

# Ambil beberapa fitur untuk dianalisis
feature_cols = ['f1', 'f2', 'f3', 'f4']

for i, feature in enumerate(feature_cols):
    for category in embedding_df['kategori'].unique():
        data = embedding_df[embedding_df['kategori'] == category][feature]
        axes[i].hist(data, alpha=0.6, label=category, bins=20)
    
    axes[i].set_title(f'Distribution of {feature} by Category')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
_images/e9af3792faa30a73f4872d1e505b23f7b738ddd356018a81303ab795e421d9d9.png
# 5. Analisis Similaritas Kata dengan Word2Vec
# Ambil beberapa kata yang ada dalam vocabulary
vocab_words = list(model.wv.key_to_index.keys())[:20]  # Ambil 20 kata pertama

# Hitung similarity matrix untuk kata-kata
word_similarities = []
for word1 in vocab_words:
    row = []
    for word2 in vocab_words:
        if word1 in model.wv and word2 in model.wv:
            similarity = model.wv.similarity(word1, word2)
            row.append(similarity)
        else:
            row.append(0)
    word_similarities.append(row)

word_similarities = np.array(word_similarities)

# Visualisasi heatmap similarity kata
plt.figure(figsize=(12, 10))
plt.imshow(word_similarities, cmap='viridis', aspect='auto')
plt.colorbar(label='Word Similarity')
plt.title('Word Similarity Matrix (Word2Vec)')
plt.xlabel('Words')
plt.ylabel('Words')

# Set labels
plt.xticks(range(len(vocab_words)), vocab_words, rotation=45, ha='right')
plt.yticks(range(len(vocab_words)), vocab_words)

plt.tight_layout()
plt.show()

print(f"Vocabulary size: {len(model.wv.key_to_index)}")
print(f"Sample words: {vocab_words[:10]}")
_images/526bea6b70d5434fdb3583734113cb10c9e40d31c80e0a58844eee13963625c3.png
Vocabulary size: 5847
Sample words: ['', 'iphone', 'indonesia', 'to', 'scroll', 'with', 'content', 'continue', 'advertisement', 'menteri']
# Test Plotly setelah install nbformat
import plotly.express as px
import pandas as pd
import numpy as np

# Buat data test sederhana
test_data = pd.DataFrame({
    'x': np.random.randn(10),
    'y': np.random.randn(10),
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C']
})

# Test plotly
fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly')
fig.show()

print("✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.")
✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.
# Solusi 2: Install ulang library di dalam notebook
import sys
!{sys.executable} -m pip install --upgrade nbformat ipython
Requirement already satisfied: nbformat in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (5.10.4)
Requirement already satisfied: ipython in c:\users\user\appdata\roaming\python\python311\site-packages (9.6.0)
Requirement already satisfied: fastjsonschema>=2.15 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (2.21.2)
Requirement already satisfied: jsonschema>=2.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (4.25.1)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.8.1)
Requirement already satisfied: traitlets>=5.1 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.14.3)
Requirement already satisfied: colorama in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (0.4.6)
Requirement already satisfied: decorator in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (1.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.19.2)
Requirement already satisfied: matplotlib-inline in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.1.7)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (3.0.52)
Requirement already satisfied: pygments>=2.4.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (2.19.2)
Requirement already satisfied: stack_data in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.6.3)
Requirement already satisfied: typing_extensions>=4.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (4.15.0)
Requirement already satisfied: wcwidth in c:\users\user\appdata\roaming\python\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython) (0.2.14)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\user\appdata\roaming\python\python311\site-packages (from jedi>=0.16->ipython) (0.8.5)
Requirement already satisfied: attrs>=22.2.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (25.3.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.36.2)
Requirement already satisfied: rpds-py>=0.7.1 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.27.1)
Requirement already satisfied: platformdirs>=2.5 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (4.4.0)
Requirement already satisfied: pywin32>=300 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (311)
Requirement already satisfied: executing>=1.2.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (2.2.1)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (3.0.0)
Requirement already satisfied: pure-eval in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (0.2.3)
# Solusi 3: Set renderer plotly yang berbeda
import plotly.io as pio

# Coba beberapa renderer yang berbeda
try:
    # Renderer untuk Jupyter notebook
    pio.renderers.default = "notebook"
    print("✅ Renderer set ke 'notebook'")
except:
    try:
        # Renderer untuk browser
        pio.renderers.default = "browser"
        print("✅ Renderer set ke 'browser'")
    except:
        # Renderer HTML
        pio.renderers.default = "html"
        print("✅ Renderer set ke 'html'")

# Test dengan data sederhana
import plotly.express as px
import pandas as pd
import numpy as np

test_data = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [2, 4, 1, 3, 5],
    'category': ['A', 'B', 'A', 'C', 'B']
})

fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly dengan Renderer Baru')
fig.show()
✅ Renderer set ke 'notebook'
# Solusi untuk Error Similarity Heatmap
# Import library yang diperlukan
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np

# Cek apakah embedding_df sudah ada
try:
    # Cek apakah embedding_df sudah dibuat
    if 'embedding_df' not in locals():
        print("❌ Error: embedding_df belum dibuat. Jalankan cell sebelumnya terlebih dahulu.")
    else:
        print(f"✅ embedding_df tersedia dengan shape: {embedding_df.shape}")
        
        # Cek apakah kolom kategori ada
        if 'kategori' not in embedding_df.columns:
            print("❌ Error: Kolom 'kategori' tidak ditemukan di embedding_df")
            print(f"Kolom yang tersedia: {list(embedding_df.columns)}")
        else:
            print("✅ Kolom 'kategori' tersedia")
            
except NameError as e:
    print(f"❌ Error: {e}")
    print("Pastikan semua cell sebelumnya sudah dijalankan dengan benar.")
✅ embedding_df tersedia dengan shape: (151, 57)
✅ Kolom 'kategori' tersedia
# Kode Similarity Heatmap yang Diperbaiki dengan Error Handling
def create_similarity_heatmap(embedding_df, sample_size=20):
    """
    Membuat similarity heatmap dengan error handling yang baik
    
    Parameters:
    - embedding_df: DataFrame yang berisi embedding
    - sample_size: Jumlah sample untuk heatmap (default: 20)
    """
    
    try:
        # Import library yang diperlukan
        from sklearn.metrics.pairwise import cosine_similarity
        import matplotlib.pyplot as plt
        import numpy as np
        
        # Cek apakah embedding_df ada
        if 'embedding_df' not in locals() and 'embedding_df' not in globals():
            print("❌ Error: embedding_df tidak ditemukan")
            return None
            
        # Cek apakah ada data
        if len(embedding_df) == 0:
            print("❌ Error: embedding_df kosong")
            return None
            
        # Tentukan kolom fitur (exclude kolom non-numerik)
        feature_cols = [col for col in embedding_df.columns if col.startswith('f')]
        if len(feature_cols) == 0:
            print("❌ Error: Tidak ditemukan kolom fitur (f1, f2, dll)")
            return None
            
        print(f"✅ Ditemukan {len(feature_cols)} kolom fitur")
        
        # Ambil sample dokumen
        sample_size = min(sample_size, len(embedding_df))
        sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
        sample_embeddings = embedding_df.iloc[sample_indices][feature_cols]
        
        print(f"✅ Sample {sample_size} dokumen untuk heatmap")
        
        # Hitung cosine similarity
        similarity_matrix = cosine_similarity(sample_embeddings)
        
        # Visualisasi heatmap
        plt.figure(figsize=(12, 10))
        plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
        plt.colorbar(label='Cosine Similarity')
        plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
        plt.xlabel('Document Index')
        plt.ylabel('Document Index')
        
        # Tambahkan label kategori jika tersedia
        if 'kategori' in embedding_df.columns:
            categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
            for i, cat in enumerate(categories_sample):
                plt.text(i, -0.5, str(cat)[:3], rotation=45, ha='right', va='top', fontsize=8)
            plt.text(0, -1.5, "Kategori:", fontsize=10, fontweight='bold')
        else:
            print("⚠️ Kolom 'kategori' tidak ditemukan, heatmap tanpa label kategori")
        
        plt.tight_layout()
        plt.show()
        
        # Statistik similarity
        print(f"\n📊 Statistik Similarity Matrix:")
        print(f"   Shape: {similarity_matrix.shape}")
        print(f"   Average similarity: {similarity_matrix.mean():.3f}")
        print(f"   Max similarity: {similarity_matrix.max():.3f}")
        print(f"   Min similarity: {similarity_matrix.min():.3f}")
        
        # Hitung similarity tanpa diagonal (self-similarity)
        mask = ~np.eye(similarity_matrix.shape[0], dtype=bool)
        off_diagonal_similarities = similarity_matrix[mask]
        print(f"   Average similarity (excluding diagonal): {off_diagonal_similarities.mean():.3f}")
        
        return similarity_matrix
        
    except Exception as e:
        print(f"❌ Error dalam membuat similarity heatmap: {str(e)}")
        print("Pastikan semua library sudah terinstall dan data sudah siap")
        return None

# Jalankan fungsi
print("🚀 Membuat Similarity Heatmap...")
similarity_matrix = create_similarity_heatmap(embedding_df, sample_size=20)
🚀 Membuat Similarity Heatmap...
✅ Ditemukan 56 kolom fitur
✅ Sample 20 dokumen untuk heatmap
_images/1ae5b604259f5f6bd5783c3897af3fb2bff3b9c57e93e7179cb37286ee622fd1.png
📊 Statistik Similarity Matrix:
   Shape: (20, 20)
   Average similarity: 0.996
   Max similarity: 1.000
   Min similarity: 0.980
   Average similarity (excluding diagonal): 0.996